Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): create separate ibis.FileDataset #842

Merged
merged 15 commits into from
Oct 11, 2024

Conversation

deepyaman
Copy link
Member

@deepyaman deepyaman commented Sep 20, 2024

Description

Resolves #828

Development notes

So far, I copied ibis.TableDataset, removing code paths for reading database tables, and adding support for file export.

Just wanted to put this out there for early feedback, but what else should be done?

  • Deprecate file I/O in TableDataset
  • Support versioning (I don't think there's any reason why this wouldn't work out of the box for FileDataset)
  • Docs stuff (add FileDataset to toctree, etc.)

Update: Versioning is actually not a trivial subject, because backends don't implement a consistent interface for checking whether a file exists. I plan to do this with PyArrow filesystem, but I will do that in a follow-up PR (to limit complexity added here); this PR handles local versioning.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@deepyaman deepyaman force-pushed the feat/datasets/ibis-filedataset branch 7 times, most recently from 88f74cb to c8f076e Compare September 24, 2024 04:13
@deepyaman deepyaman marked this pull request as ready for review September 24, 2024 05:04
Copy link
Contributor

@datajoely datajoely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

@deepyaman deepyaman mentioned this pull request Sep 30, 2024
@deepyaman deepyaman force-pushed the feat/datasets/ibis-filedataset branch from f5436cb to af42be8 Compare October 4, 2024 15:50
@deepyaman
Copy link
Member Author

One of the downsides of having a separate ibis.FileDataset is that the connection cache isn't shared, so you won't reuse an existing connection across ibis.FileDataset and ibis.TableDataset. Maybe this could be resolved by having the _connections class variable tied to a mixin these both inherit from, or, perhaps more simply, taking a Backend object and using a custom resolver to pass that.

I can explore this in a future PR; this is probably good enough for now.

@deepyaman deepyaman force-pushed the feat/datasets/ibis-filedataset branch 2 times, most recently from ad21e5a to fe56964 Compare October 7, 2024 00:12
@deepyaman deepyaman enabled auto-merge (squash) October 7, 2024 00:31
Copy link
Contributor

@datajoely datajoely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 99% there, I'd prefer some extra bits of clarity so that @deepyaman 's bus factor isn't as high when it comes to maintainability

Very excited to get this in 💪

@deepyaman deepyaman force-pushed the feat/datasets/ibis-filedataset branch 2 times, most recently from ff0b85a to bce8c57 Compare October 10, 2024 12:28
Copy link
Contributor

@datajoely datajoely left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Refs: b7ff0c7

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@deepyaman deepyaman force-pushed the feat/datasets/ibis-filedataset branch from 27a4f4b to b43163e Compare October 11, 2024 00:54
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, it's a nice addition to the ibis datasets

@deepyaman deepyaman merged commit 3d36e57 into main Oct 11, 2024
12 checks passed
@deepyaman deepyaman deleted the feat/datasets/ibis-filedataset branch October 11, 2024 09:22
tdhooghe pushed a commit to tdhooghe/kedro-plugins that referenced this pull request Oct 21, 2024
* feat(datasets): create separate `ibis.FileDataset`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): deprecate `TableDataset` file I/O

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets): implement `FileDataset` versioning

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): try `os.path.exists`, for Windows

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* revert(datasets): use pathlib, ignore Windows test

Refs: b7ff0c7

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* docs(datasets): add `ibis.FileDataset` to contents

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets)!: expose `load` and `save` publicly

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): remove second filepath assignment

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
tdhooghe pushed a commit to tdhooghe/kedro-plugins that referenced this pull request Oct 21, 2024
* feat(datasets): create separate `ibis.FileDataset`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): deprecate `TableDataset` file I/O

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets): implement `FileDataset` versioning

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): try `os.path.exists`, for Windows

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* revert(datasets): use pathlib, ignore Windows test

Refs: b7ff0c7

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* docs(datasets): add `ibis.FileDataset` to contents

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): add docstring for `hashable` func

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* feat(datasets)!: expose `load` and `save` publicly

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore(datasets): remove second filepath assignment

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: tdhooghe <thomas_dhooghe@mckinsey.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add write support for ibis.TableDataset to files
3 participants